The full singularity and snakemake pipeline is available and described here.
Methodology from trinity.
Assembled transcripts might not always fully represent properly paired-end reads, as some transcripts may be fragmented or short and only one fragment read of a pair may align. To assess the read composition of our assembly, we want to capture and count all reads that map to our assembled transcripts, including the properly paired and those that are not. We obtained a very good realignments representation with 96.36% of realignments for the biggest library (Fig. 1).
Figure 1: Realignments representation for library ag08 on the de novo transcriptome.
Methodology from trinity.
One metric for evaluating the quality of a transcriptome assembly is to examine the number of transcripts that were assembled that appear to be full-length or nearly full-length. A general analysis to perform is to align the assembled transcripts against all known proteins and to determine the number of unique top matching proteins that align across more than X% of its length. We have 48% of conserved proteins (>80%) in the transcripts (Fig. 2), which is very good.
Figure 2: Full-length transcripts by comparison with SwissProt.
Methodology from BUSCO.
Based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs, BUSCO metric is complementary to technical metrics like N50. We have very good BUSCO results with no missing genes (Fig. 3) and only 6 fragmented.
Figure 3: BUSCO results for Aechmea transcripts.
Methodology from trinity.
Based on the lengths of the assembled transcriptome contigs, we can compute the conventional Nx length statistic, such that at least x% of the assembled transcript nucleotides are found in contigs that are at least of Nx length. The traditional method is computing N50, such that at least half of all assembled bases are in transcript contigs of at least the N50 length value.
## ## Stats based on ONLY LONGEST ISOFORM per 'GENE':
## #####################################################
##
## Contig N10: 3156
## Contig N20: 2212
## Contig N30: 1595
## Contig N40: 1117
## Contig N50: 757
##
## Median contig length: 303
## Average contig: 541.20
## Total assembled bases: 169239078
Methodology from trinity.
The contig N50 values can often be exaggerated due to an assembly program generating too many transcript isoforms, especially for the longer transcripts. An alternative to the Contig Nx statistic that could be considered more appropriate for transcriptome assembly data is the ExN50 statistic. Here, the N50 statistic is computed as above but limited to the top most highly expressed genes that represent x% of the total normalized expression data. The gene expression is take as the sum of the transcript isoform expression and the gene length is computed as the expression-weighted mean of isoform lengths.
Figure 4: Caption.
Methodology from trinity.
We can estimate the level of strand-specificity of the RNA-Seq data by aligning the reads back to your Trinity assembly and examining the distribution of RNA-Seq read (or fragment) orientations on those assemblies. As expected for strand-specific RNAseq data (dUTP approach), we obtained almost exclusively reverse reads after realignments for the biggest library on the transcriptome (Fig. 5).
Figure 5: Caption.
Methodology from trinity.
Ensure that your biological replicates are well correlated, and investigate relationships among samples. If there are any obvious discrepancies among samples and replicates relationships such as due to accidental mis-labeling of sample replicates, or strong outliers or batch effects, we’ll identify them before proceeding to subsequent data analyses (such as differential expression).
Now let’s compare replicates across all samples. The replicates are more highly correlated among than within biological replicates!
Now let’s compare replicates across all samples. The replicates are more highly correlated among than within biological replicates!
Now let’s compare replicates across all samples. The replicates are more highly correlated among than within biological replicates!
Methodology from trinity.
Our current system for identifying differentially expressed transcripts relies on using the EdgeR Bioconductor package. We have a protocol and scripts described below for identifying differentially expressed transcripts and clustering transcripts according to expression profiles. This process is somewhat interactive, and described are automated approaches as well as manual approaches to refining gene clusters and examining their corresponding expression patterns.
Methodology from trinity.
SuperTranscripts provide a gene-like view of the transcriptional complexity of a gene. SuperTranscripts were originally defined by Nadia Davidson, Anthony Hawkins, and Alicia Oshlack as described in their publication “SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes” Genome Biology, 2017. SuperTranscripts are useful in the context of genome-free de novo transcriptome assembly in that they provide a genome-like reference for studying aspects of the gene including differential transcript usage (aka. differential exon usage) and as a substrate for mapping reads and identifying allelic polymorphisms.
A SuperTranscript is constructed by collapsing unique and common sequence regions among splicing isoforms into a single linear sequence. An illustration of this is shown below:
Figure 6: Caption.
Methodology from trinity.
Using SuperTranscripts, we can explore differential transcript usage (DTU). Differential transcript usage analysis is complementary to differential gene expression (DGE) and differential transcript expression (DTE) analysis. For details on how DTU, DGE, and DTE compare, see “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences” by Soneson, Love, and Robinson; F1000 2016.
Methodology from trinity.